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ABSTRACT 

The discretization of separable elliptic partial differential equations leads to lin- 
ear systems with special block tridiagonal matrices. Several methods are known 
to solve these systems, the most general of which is the Block Cyclic Reduc- 
tion (BCR) algorithm which handles equations with nonconstant coefficients. A 
method was recently proposed to parallelize and vectorize BCR. In this paper we 
discuss the mapping of BCR on distributed memory architectures and compare 
its complexity with that of other approaches including the Alternating-Direction 
method. We also describe a fast parallel solver, based on an explicit formula 
for the solution, which has parallel computational complexity lower than that of 
parallel BCR. 

Keywords: Block cyclic reduction algorithm, parallel processing, partial fractions, 
hypercube computers, computational complexity, alternating direction algorithm. 

1. Introduction. In this paper we are concerned with efficient parallel 
methods for solving block tridiagonal systems that arise from the discretiza- 
tion of the general separable elliptic equation 


+d ( y )~d^ + € ^dy + 9 ^ U = 

These systems are of great importance particularly because their solution 
may be required in the inner loop of an iterative procedure, in precondition- 
ing more complex systems, or in the content of time-stepping techniques. 
When (l.l) is of Poisson type in one direction and is defined on a domain 
which allows separation of variables to be used ([27]), there exist special fast 
methods which for N unknowns achieve sequential complexity of 0(N log N) 
([4,5,11,26]) and parallel computational complexity of 0(logN) ([6,18,25]). 

We concentrate on methods which in principle do not rely on fast trans- 
forms and can thus be used to handle discrete equations as general as (1.1). 
We describe their mapping on parallel architectures and investigate their 
computational and communication complexities. The methods we discuss 
are block cyclic reduction (BCR), alternating direction implicit procedure 
(ADI), and a new explicit elliptic solver (EES). 

Parallel BCR was recently introduced in [7] and [25]. Its implementa- 
tion and performance were discussed for the case of the Alliant FX/8 shared 
memory vector multiprocessor in the former and for the Cray-1 in the lat- 
ter. A very brief discussion of the mapping of the algorithm on hypercubes 
and multicluster shared memory architectures can be found in [6,23] and [9] 
respectively. We outline the algorithm and its parallelization in Section 2 
and discuss its mapping on distributed memory architectures, particularly 
on hypercubes, in Section 2.1. Several mapping strategies are considered, 
depending on the size of the problem and the number of processors and 
their parallel arithmetic and communication complexities are discussed. In 
Section 2.2 we describe an implementation of BCR for massively parallel ar- 
chitectures. If scalar cyclic reduction is used to solve the tridiagonal systems 
then the parallel computational complexity of BCR is O(lognlogm), for a 
block tridiagonal system of n blocks each of dimension m. (A = nx m). This 
is inferior to the O(lognm) complexity associated with FFT based methods. 
In Section 3 we introduce an O (log nm) parallel algorithm which we call 
Explicit Elliptic Sover (EES). This algorithm affords the same generality as 
BCR and is very simple to implement. Nevertheless, it may be impractical 
due to its requirement of a very large number of pr ocessors. In Section 4 we 
briefly discuss the use of these methods for the most general separable prob- 
lem. In Section 5 we examine ADI methods. Although, strictly speaking, 
these are iterative techniques, their sequential complexity to achieve a level 
of accuracy that is comparable with discretization errors, is of the order of 
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0(n 2 log 2 n) when m — n. For an n 2 processor hypercube connected system, 
their parallel complexity is of order 0(log 3 n). Moreover, a nonnegligible 
advantage is that they are far easier to implement than the BCR schemes. 
Finally in Section 6 we provide some concluding remarks. 


2. Block Cyclic Reduction. Consider the more restricted form of 
Eq. (1.1) 


( 2 . 1 ) 


a(a:) l^ + 6(a;) ^ + c(a:)u + 



/(*>y) 


with Dirichlet boundary conditions on a rectangle. If we discretize with 
a 5 point stencil on a naturally ordered n x m grid we obtain the block 
tridiagonal system: 


(2.2) 
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A and I are tridiagonal and unit matrices respectively, of order m. The 
vectors Vj and fj are of order m. 

The BCR algorithm for solving systems in the above form was introduced 
as early as 1965 by Hockney ([ll]). The basic idea of the first step is to 
combine every block-row of even number with the two adjacent block rows so 
as to eliminate the odd (block) variables. The process is repeated until only 
one block variable remains. A back-substitution is then applied to compute 
all the unknowns. In its direct application the method is unstable, but a 
stabilized version was introduced by Buneman ([3]) at a cost of increased 
complexity. The first descriptions of the method assumed that the number 
of blocks n = 2 k - 1 ([4]). The extension of the method to any n is due to 
Sweet ([26]). For a short history of BCR we refer to [21]. The sequential 
complexity of the method is 0(nm log n). 

Let us rewrite (2.2) as 


(2.3) Av = / 

and assume, to simplify the notation, that n = 2 k — 1. In the r-th reduction 
step of BCR, r = 1, . . . , Jfe — 1 , the current 2 k ~ r+1 - 1 right-hand-side vectors 
are combined into 2 k ~ r - 1 ones, producing a system of the form 

= / {r) 
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in which. is a block-tridiagonal matrix of block dimension 2 1 

whose diagonal blocks are all equal to a matrix A^ and whose co-diagonal 
blocks are equal to -I. The matrix A < T ) can be expressed as a Chebyshev 
polynomial of degree 2 r_1 in A , which we write as A In 
Buneman’s version of the algorithm, the explicit computation of the right- 
hand-sides / of the reduced systems is avoided by introducing auxiliary 
vectors p and q which are defined through the solution to a system of the 
form 


A^X r = Y r 


where Y r E Since the roots A, M of p^-i are known, A can 

be written in product form, where each factor is a tridiagonal matrix of the 
form A — A similar strategy is used for the back-substitution phase. 

For completeness we next describe the Buneman algorithm. 


Algorithm: BCR, Buneman’s version 

A. Initialize: p\° ^ = 0 ,q^ — fj,j = 1» — , n nnd h = 1, r = 0. 

B. Forward solution: ^ 

1. Form the matrix Y r with columns q { 2 y h + P( 2 j_i)h + P( 2 j+i)h' 


j = 1, . . . , (n + l)/2h — 1 

2. Solve the (multi)- linear system A^X T = Y r 

3. Update the vectors p and q according to 


(2.4) jft 1 ’ = )■ = !,. 


,2 


fc— r— 1 


- 1 


(0 o > +1 ) - 2o (r+1) + <7 (r) 
( 2 . 5 ) q 2 j h — Zp 2jh + 


j = 1 2 fc-r_1 - 1 


4. If h<n then h = 2 h, r = r - H; go to 1. 

C. Solve for u: A^u — q[ r ^ and set = Ph + ti. 

D* Backward substitution: while h > 1 do 

1. h = h/2 

2. Form the matrix Y r with column vectors + v (j+i)/n 

j — 1? 3, 5, . . . , n/h. 

3. Solve the (multi)- linear system A^U r = Y r 

4. Update the solution vectors Vjh,j = 1, 3 , • ■ ■ ? by V r = P r + U r , 
where V r (resp. P r ) is the matrix with vector columns Vjh ( resp. 

Pjh )■ 

In (2.4), the vector X r e 3 is the j-th column of the matrix X r . As was 
mentioned above, since the roots of p^-i are known the systems in B.2 can 
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□ 

□ □ 



□ □□□ 000 

□□□□□□□□ 0 

FlG. 1. Serial Block Cyclic Reduction for n = 31. 


be written 

2 T— 1 

(2.6) n<*- A S-x J )[*il • • • l z 2 fc -’ , -i] = M • • ■ l»a*-'-i]- 

i=l 

Clearly, as r increases the effectiveness with which a parallel or vector com- 
puter can handle (2.6) decreases rapidly since the number of right hand 
sides available decays geometrically. Figure 1 depicts the rapid increase in 
the sequential factors (the blocks on the left) in contrast to the equally rapid 
decrease in the number of independent systems (the vectors on the right) 
for n = 31. The parallel version of BCR is based on expressing the matrix 
function [p 2 ,-i(A)]- 1 as a partial fraction, i.e. as a linear combination of the 

2 r_1 components (A - 

(2.7) [a:i| • • • a i r) ( A ~ A i-i J ) _1 foil ' ' ' llfa*— -il- 

1 = 1 

The coefficients aj r) are equal to 1 /j/ 2 ,_i(a£\) and can be derived analyti- 
cally. The cases of Neumann and periodic boundary conditions can also be 
handled similarly. Figure 2 depicts the parallel reduction for n = 31. When 
the number of blocks n is not equal to 2 k - 1, additional systems of the form 

h l 

(2.8) JJ(A - Ai_il)* = XI( A “ 

;=i i= l 

must be solved at each step. In this case the rational matrix polynomial 

(2.9) HU" W- lJ ) • I II 

j=i i=i 

is also reduced into a sum of partial fractions ([8]). 
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□ Q! 
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□ODD 
□ Oi 




□ OOQ + DODO + DODO + DODO 

□o + nD + no + no + no + no+no + no 

Fig. 2. Parallel Block Cyclic Reduction for n -- 31. 


The same technique can be applied for general separable equations fol- 
lowing the algorithm in [22], the main difference being that the roots of the 
polynomial are not known beforehand in an analytic form. Standard meth- 
ods can be used for the numerical computation of these roots, followed by 
the technique described earlier. We note that partial fraction expansions for 
BCR were also independently advocated in [25]. They have also been used 
in a similar context by Swayne [24] who was not, however, motivated by 
parallel computing. Their use in a different context for parallel processing 
was advocated by II. T. Kung [16]. 

2.1. Block Cyclic Reduction on hypercubes. Distributed memory 
machines based on hypercube networks represent an excellent compromise 
between fully connected arrays and grid arrays and have recently been devel- 
oped into several commercial products. A p-cube network or p-dimensional 
hypercube, consists of P — 2 P identical nodes that are interconnected to each 
other in such a way that each node has p neighbors. The rigorous description 
of the interconnection involves a binary numbering of the nodes: two nodes 
are connected if and only if their binary numbers differ in one and only one 
bit. Thus, for p = 3, a p-cube can be represented as an ordinary cube in 
three dimensions where the vertices are the 8 = 2 3 nodes of the 3-cube. For 
p — 4 one can represent the hypercube network as shown in Figure 3. 

In this section we consider several mappings of the BCR algorithm on 
hypercubes. Sections 2.1.1 and 2.1.2 deal with the limited processor case, 
in that the number of processors P is assumed to be smaller than m in the 
former and n in the latter. Section 2.2 deals with the case P > N. 

An important notion in hypercube architectures is that of Gray-codes. 
A Gray code of order p is simply a sequence go, ... ,g 2 P-i, of all the p-bit 
binary numbers such that two consecutive elements of the sequence differ in 
exactly one bit, i.e., H(gi, fft+i) = 1? where H{x,y) denotes the Hamming 
distance between x and y. In particular, the binary reflected Gray code 
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sequence of order p, see [17] has the remarkable additional property that 
= 2, for j > 1. In the remainder of this paper we will use the 
term Gray code to refer to binary reflected Gray codes. 

2.1.1* Interleaving right hand sides vertically across proces- 
sors. We first describe a mapping of the data that leads to a simple and 
efficient algorithm for the case m P. Considering each of the subvectors 
fj of the right-hand-side /, we assign its / — th component fjj to processor 
gi_i for / < P and more generally, 

MapI: Assign component fjj to node g mo d(l-x y 2 *)• 

Thus the processor labeled gi_\ will hold all the components l + vP , 
where / + vP < m, of each subvector fj of /. A consequence of this map- 
ping is that when solving the simultaneous tridiagonal systems with Gaus- 
sian elimination, the forward and backward sweeps will only require nearest 
neighbor communication. 

In the very first step of block cyclic reduction, there are — 1 tridiag- 
onal linear systems to solve with the same matrix A, and different right hand 
sides, /2,/4>— i/2(2 fc -*-i)- As is illustrated in Figure 4 each of these inde- 
pendent tridiagonal systems is interleaved across the P processors. These 
linear systems can be solved by pipelined Gaussian elimination, which we 
now describe. 

The first elimination step consists of communicating the first row and top 
element of the first right hand side down from processor <70 to its neighbor 
g\ which then performs the forward elimination step. In the second step 
processor g\ sends the second row of the first right-hand-side to processor 
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FlG. 4. Interleaving assignment of three right-hand-sides across four processors . 


<72 while processor go will send the first element of the second right hand 
side down to g \ . Then g\ performs the first elimination step for the second 
right hand side and <72 the second step of elimination for the first right hand 
side. Note that here g\ does less computation than g*i< For later reference, 
we display in Figure 5 the main loop involved in the solution of a system 
with the tridiagonal matrix whose nonzero components in the ith row are 
b(i),d(i),c(i). 

The pivots z may be saved in place of b(i) since they will be needed for 
the other right hand sides. The remainder of the process is similar. Initially 
most of the processors are idle but after the first P elimination steps, every 
processor becomes active and the granularity of the tasks increases. When 
the forward elimination is completed, the backward elimination is performed 
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c Forward solve: 
do i=2, m 

z * b(i)/d(i-l) 
d(i) = d(i) - z*c(i-l) 
y (i) = y(i) - z*y(i-l) 

enddo 

c Backward solve 
y(m) = y(m)/d(m) 
do i = m-1, 1, -1 

y(i) = (y(i) - c(i)*y(i+l))/d(i) 

enddo 


FlG. 5. Gaussian elimination for tridiagonal system . 


in the same way but backwards. 

In subsequent steps of the parallel version of Buneinan’s algorithm de- 
scribed in Section 2, we have several tridiagonal systems to solve with the 
same right hand sides but different tridiagonal matrices. For simplicity we 
drop the superscript index (r) from the A’s; it is understood however that 
the set of A’s changes at each step of block cyclic reduction. Thus, in the 
second step of the forward reduction, we end up with systems of the form 
(A — Ao/)(A — Ai I)X = Y where this time the number of right hand sides 
in Y is 2 k ~ 2 — 1, or nearly half what it was in the first step. One pos- 
sibility is to proceed sequentially with respect to the A’s, i.e., we can use 
the pipelining procedure described above to solve (A - Ao I)Z = Y and 
then ( A — Ai I)X — Z . However, towards the end of the forward elim- 
ination, the number of right hand sides decreases and a better alterna- 
tive is to use the parallel technique based on partial fractions, described 
in Section 2.1. This amounts to solving the independent tridiagonal sys- 
tems ( A — \qI)Xq = Y and (A - \iI)Xi = Y and then computing a linear 
combination of X;,z = 0, 1. It is interesting to observe that from the point 
of view of implementation, one can consider that we are solving altogether 
twice as many tridiagonal systems independently. The right hand sides can 
therefore be duplicated as many times as there are A’s and we are back to 
the situation of step one except that not all of the distinct right hand sides 
must be solved with the same tridiagonal matrix. For example, in step 2 we 
will have to solve 2[2 k ~~ 2 — 1] independent tridiagonal systems, half of which 
involve the same matrix (A - A 0 /) and the other half the matrix (A - A } J). 
The pipelined procedure described for the first step can be used. 

Let us now estimate the time that it takes to perform the r-th step of 
the forward elimination. For reasons that will be explained later we will 
assume that m > n. We will use the following standard and simple model. 
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To move a data set of j words from one processor to a neighbor takes a time 
of 

(2.10) /? + jr 

seconds, wffiile performing j arithmetic operations in vector mode, 

(2.11) 7 + jw 

seconds. Note that this model includes the case where each processor is 
a scalar processor, by taking 7 = 0. We w T ill often refer to Buneman’s 
algorithm described in Section 2.1. At the start of every step, the algorithm 
requires forming the matrix Y r of right-hand-sides. Each column vector yj of 
Y r is a linear combination of the vectors q * s and p’s in the standard notation 
of Buneman’s algorithm and they are obtained by the formula, 

(2-12) Vj - %2 jh +P(2i-l)fc + P(2j+l)/i 3 ~ 

Clearly, this requires no communication since a given processor contains the 
same components of each of the vectors to be combined. The time for this 
is approximately, 

r m.. 

2 7 + 2[— ]w. 

The next phase of the elimination step consists of solving 2 r ~ 1 tridiag- 
onal systems each of dimension m and having 2 k ~ r ~ 1 different right hand 
sides arranged in interleaved order. Following our previous discussion, we 
will assume that we must solve exactly q = 2 r ~*( 2 fc ~ r — l) tridiagonal sys- 
tems each with a different right hand side and a different diagonal. Clearly, 
this is not quite accurate since many of the matrices are identical as was 
seen above, but it gives an upper bound for the time estimates that is much 
simpler to derive. 

In pipelined algorithms there is a pipe -fill time which corresponds to 
the first few steps before all processors reach their high regime of efficiency. 
The first P - 1 steps see processors goy^gp-i becoming gradually busy 
each doing one elimination step consisting of reading three floating point 
numbers from a previous processor and performing forward elimination at a 
total cost of 

(P- i)[j9+"3r + 37 + 5w] 

Similarly, in the next P steps all processors will be busy but they will deal 
with two elements of the right hand side instead of just one. The new cost 
is - 

P{(3 + 6r -f 67 -f 10u?] 
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It is only after the first component of the last right hand side is processed, 
i.e., after step g, that the processors will start doing essentially the same 
work at every new step. If we let s = [g/P], then the total time that it 
takes before this is achieved is approxirrni <dy, 

* $ s 

(2.13) ^ P[j3 + 3 jr + 3 7 + bju>] * Ps[(i + 37 + 3 -r + 5-u>] 

j=i 

Note that a similar phenomenon takes place in reverse in the last q steps. 
Apart from these first and last q steps each of the remaining m — 2q steps 
of the algorithm, takes the same amount of time which is approximately, 
ft + 37 + 3 sr + 5 su) and the total becomes 

(2.14) 2 Ps[(3 + 7 + 3^t + 5^u>] + (m - 2 q)[Q + 37 + 3 st + 5w] 

£ £ 

Note here that we need to have m - 2 g > 0 which, using the fact that q is 
a decreasing function of r leads to m > 2 k - 2 — n — 1 or m "> n y which 
justifies our earlier assumption. 

The partial fraction expansion formula (2.7) requires that we now take 
a linear combination of 2 r_1 matrices of 2 k ~ T — 1 columns of length m each. 
These matrices are split equally among the processors and the time required 
for this linear combination is approximately, 

2 T ~ l (2 k ~ T - 1)(7 + = <?( 7 + 

assuming that the linear combinations are performed column- wise. Note 
that once more no communication is required. 

Finally, we need to update the p and q vectors according to (2.4), (2.5). 
Again this requires no interprocessor communication and the arithmetic time 
is approximately, 7 4 f° r (2*4) and 27 + 2 |"^~|u; for (2.5). 

Adding up all these times, and separating the communication from arith- 
metic complexities we obtain 

(2.15) T cornm » ( 2 Ps + m - 5g)/3 + 3 s(Ps + m - 2 g)r 
for communication and 

(2.16) Tarith « ( 2 Ps 4 - 3 m - 5g)7 

4 - (5 Ps 2 + 5(m - 2 q) 4- {q + 5)[— ])w 

for arithmetic. 

Note that since q = 2 k ~ l - 2 r "\ for q large relative to the number 
of processors, Ps will be of the order of q which implies that the number 
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of start-ups is of the order of 2q — 2 k - 2 r in both communication and 
arithmetic. Apart from the start-up coefficients, the algorithm is dominated 
by the 0(Ps 2 ) terms in both communication and arithmetic, which are of 
the same order as q 2 /P in a typical situation. 

When the number of right hand sides is small, blocking can be used to 
reduce the effect of start-up overhead. Going back to the original version of 
the algorithm, we notice that at every step only a few arithmetic computation 
and a few data transfers are performed at every step. If each processor 
has a vector processing capability then this may be inefficient. Similarly, 
many machines have high latency times in communication and it is always 
preferable when possible to transfer a sizable amount of data at once rather 
than just a few. The remedy to this is to use a simple and standard scheme in 
pipelining which consists of blocking the computations. Instead of treating 
only one right hand side at a time, we can deal with a group of v right hand 
sides simultaneously at each time. Then each elementary step of Gaussian 
elimination can be performed as a vector operation across the u right hand 
sides. 

2.1.2. Horizontal distribution of the right hand sides. In this 
section we consider another implementation of BCR in which the right hand 
side vectors are not distributed vertically, but horizontally, i.e., a whole right 
hand side (vector) is now assigned to the same processor instead of being 
split and shared among several ones as in the previous subsection. Again 
we consider a hypercube system with P = 2 P processors and assume that 
P < n, where n is of the form n = 2 h — 1. 

For the purpose of illustration Jet us consider the simple case where 
n — 15 and the number of processors P — 2 P is 8. We start by mapping 
two right hand sides per processor except for the last processor which will 
only hold one right hand side. We use the Gray code mapping of the right 
hands sides /i, / 2 , / 15, at the rate of two vectors per node, which consists 
of assigning f u f 2 to node 000, / 3 , f 4 to node 001 , /2*+i, /2*+2 to node gi 

where g 0 , Si, ...,92r-i I s the standard binary reflected Gray code sequence. 
As is easily seen in this example, when combining three successive vectors 
/21-I5 /215 /2»+i as in the first step of forward reduction, we only need near- 
est neighbor communication. After these linear combinations are completed, 
each processor solves a tridiagonal system involving A, In the later reduction 
steps, subvectors with subscripts differing by a power of two are combined. 
Because of a well-known property of the binary reflected Gray code, one 
observes that communication is kept at distance of exactly two. It is impor- 
tant to notice that after a certain number of steps, in our example after just 
the first step, many processors will have no right hand sides while others 
will have exactly one right hand side. In our example, in step 2, processors 
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and ye will have one right hand side to solve with a system which is a 
degree 2 polynomial in A. This raises the question of how to put the inactive 
processors to work, which will be addressed shortly. 

Consider the situation in the general case. At each step of the reduction 
process there are 2 k ~ T - 1 right hand sides involving a polynomial of degree 
2 r_1 for r = l,...,logn — 1. Once these systems are solved, the resulting 
vectors are combined in groups: the vectors corresponding to indices i — 
2 r ~ 1 i i ) j+ 2 r_1 are added together. This gives a new set of independent 
systems with half the number of unknowns. With the exception that we are 
now dealing with vectors instead of scalars, the combination step is as in 
scalar cyclic reduction. A systematic way of mapping the right hand sides 
to the P processors is to start by partitioning the Gray code representation 
hi of the index i of each right hand side (vector), for 1 < i + 1 < n, as 

F,(i) 

< : — ■> . 

(2.17) h{ = i*_i • • • ifc_p ifc_p_i — *o 

As shown, for any integer t, 0 < x < 2* — 1, we denote by F p (i) the binary 
number formed by the p leading bits of hi , the i-th element of the Gray 
code sequence of dimension k. Hence F p (i) can serve as an identifier tag for 
P — 2 P processors. Then, the mapping rule used is as follows 

Map2: Assign component fjj, l = 1, 2, • • -,Tn to node F p (j - 1). 

We then obtain the following theorem which can be used both in scalar 
and block cyclic reduction. 

THEOREM 2.1. If n — 2* — 1 and rule Map2 is in effect in the initial 
assignment of equations to processors for cyclic reduction, the elements to 
be combined at each step belong to nodes which are at a maximum distance 
of 2 apart on the hypercube graph. 

Proof. We discuss the case of BCR. The same arguments can be used to 
prove the result for scalar cyclic reduction. Consider block index i + 1 with 
i = a2 k ~ p + l, where 0 < / < 2 k ~ p and a = 0,...,2 P - 1. At step r, each 
vector i = 2 r p with p = 1, . . . , 2 fc-r - 1 is combined with vectors i - 2 r_1 and 
i -j- 2 r_1 . According to Map 2, the equation t + 1 is in processor F p (i) = g a - 
When 2 r <i + 2 r_1 < 2 k - 1, the equations i + 2 r_1 and x - 2 r_1 are in the 
same processor because the p leading bits of their indices are identical. It is 
only when r > k — p that any of the vectors in g a would be combined with 
elements in a processor other than g a , g a + 1 or y a -i- Prom the Gray code 
assignment, processors g a and g a ±i are adjacent. When r > k — p, we write 
r = k - p+ t for t>l and combine x with x ± 2 fc_p+t , i.e. 

a2 k ~ p + l ± 2 k ~ p+t = 2 k ~ p {a ±2 ‘) + / 

By definition these lie in processors g a ± 2 «. From a fundamental property 
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of binary reflected Gray codes these processors are at distance 2 away from 
processor g a and the theorem is proved. ■ 

We distinguish two regimes in the algorithm. With n = 2 k - 1 and 
P — 2 P , the first regime is when the step number r does not exceed k — p, in 
which case each processor ends up with 2 k ~ r /2 P = 2 k ~ r ~ p right hand sides 
except for the last processor which will have one less right hand side. The 
degree of the polynomial is 2 r_1 . Thus, the work is well balanced, at the 
exception of the slight difference with the last processor. More precisely, 
each processor will have a linear system of the form p 2 r (A)X == Y to solve 
and there is no need to use partial fractions in order to load balance. In 
contrast, in the second regime, i.e., as soon as r > k - p, each processor of 
the form J = F p (j 2 r ) ends up with exactly one right hand side while any 
other processor will not have accumulated any right hand side and would 
remain idle if no counter action is taken. However, the idea of partial fraction 
expansions described earlier can be employed to achieve load balancing when 
solving the polynomial systems ( A - A 0 I)(A - A iI)...(A - X 2 r-i_il)x = y, 
of degree 2 r “ 1 that are generated at the r th reduction step in nodes labeled 
F p (j x 2 r ),jf =■ 1,...,- Again we start by illustrating the process with the 
particular case n — 15 and P = 8; i,e M k = 4,p = 3. After the first 
elimination process, all processors must solve a linear system of the form 
(A — XqI)x = y where y consists of one right hand side, and this constitutes 
the only step of the first regime where there is no need to load balance. 
In the second step, processors F 3 (2) = 001,F 3 (4) = 010 and F 3 (6) = 111, 
will have to solve each a system of the form (A - A 0 /)(A - Ai I)x — y, or 
equivalently, by the partial fraction decomposition, two independent linear 
systems (A - A il)xi = y, i = 0, 1. To make the other processors participate 
in solving these systems, we will have each of the three ‘master nodes’ 001, 
010, and 111 distribute some of the linear systems ( A - A */)z* = y to slave 
processors. To do this in a systematic manner, each master processor will 
assign the system ( A — A il)x{ = y to the (slave) processor whose label has 
the same leading 2 bits as those of the master node and the same trailing bit 
as that of the binary representation of i. Thus, node 001 sends the system 
associated with A 0 to processor 000, and keeps the system with A*. Similarly, 
node 010 sends the system associated with X\ to node 011 and node 111 sends 
the system associated with Ao to node 110. After these systems are solved 
they must be combined back in their master nodes. 

More generally, at a given step of the second regime processors numbered 
F p (j x 2 r ), j = 1, 2, , 2 fc-r - 1 and only those processors, will have to solve 
systems of the form 

(2.18) (A — X 0 I){A — Aj J) . . . (A — A 2 »—i_i)£ = y- 
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When r -- Jfc - p + 1, all the processors in the subcube of th<* nodes whose 
leading p — 1 bits are identical with those of the master node F p (j2 ), will 
have no system of the form (2.18) to solve. At any step r = k - p+ r 0 , where 
r 0 = 1,2,.. serves as iteration counter for the steps of the second regime, 
each master node F p (j2 r ) will have a system of the form (2.18) while all 
other nodes in the subcube of the nodes having the same p- r 0 leading bits, 
will have none. We can use the partial fraction expansion of Section 2 to 
decompose (2.18) into 2 r ' 1 independent linear systems 

(2.19) {A- \ x I)xi = y, i = 0,l,.. M 2 r_1 - 1, 

which should be followed by a linear combination of the x^s. Then, a simple 
strategy to distribute the linear systems (2.19) equally, is to have the master 
node broadcast the right hand side y to the nodes of its subcube consisting 
of all the nodes with the same leading p — r o bits, and have each of them 
solve the systems (2.19) for which the trailing r 0 bits of i match the trailing 
r 0 bits of the node’s label. 

In brief we have used the fact that the sets Sj consisting of the nodes f 
so that _F P (£) = F p (j2 r ), form a partition of the p-cube into subcubes, and 
we have distributed the linear systems (2.19) equally in each subcube. 

Therefore, if we denote by L,(i) the binary number consisting of the s 
trailing bits of the binary number t, we can summarize the rule by 

Load Balancing Rule: At step r — k - p + r 0 , each master node 
F p (j2 r ) makes slave nodes F p _ ro (j2 r )L To (i) solve systems (2.19). 

Note that there are p - 1 steps in the second regime. It is easy to see 
that each node will solve an equal number of linear systems which is constant 
and equal to 2 r_1_ro = 2 k ~ p ~ l , at the exception of a small number of the 
last nodes in the Gray code sequence which will have no system to solve. 
This process requires broadcasting of the right-hand-side in node J to its 
subcube and then gathering/ summing of the data in the manner distributed 
inner products are usually computed in a hypercube. With the simplest 
broadcast algorithm, these operations cost O(r 0 m) which means that the 
communication overhead in this second phase is higher than that of the 
first phase where the right hand sides y are formed. Hence depending on 
the relation between communication and arithmetic costs of the particular 
hypercube system it might be preferable not to distribute the work to all 
processors but only to those located in a smaller subcube. 

A final operation required in the second regime is to accumulate the 
different solutions back to the master node. This is essentially a gather- 
combine operation and is done by exploiting the topology of the hypercube, 
in r 0 steps consisting of moving in a higher subcube closer to the master 
node and adding the intermediate results at each time. 
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To get an estimated time of the r-th step of this algorithm, we must 
distinguish between the two regimes. We first observe that the operation 
that consists of combining the vectors p’s and q's to form the right hand 
sides is identical in both regimes except that we deal with more vectors in 
the first and that only the master nodes are actually active in the second. 
We can also view the right-hand sides in the first regime as forming a single 
long right vector of length m* = 2 k ~ r ~ p X m. Thus, the first operation in 
step r is to have each processor with label of the form J = F p [j X 2 r ] form 
a linear combination of q and p vectors of length m r to form their column of 
the right-side matrix Y. This costs 2/3 + 2m'r for communication (Bringing 
two vectors that are at most two hops away) and 2j + 2 m f u> for their linear 
combination. 

After the right-hand-sides are formed we need to solve the tridiagonal 
systems. This is processed differently in the two regimes. In the first regime, 
we need to solve in each processor independently, at most 2 T tridiagonal 
systems with 2 k ~ r ~ p right-hand-sides. Using standard Gaussian elimination 
this will consume a time of 

(2.20) 2 r x m x [37 + 8 x 2*~ r ~ p u>] 

assuming vectorization across the right-hand-sides. 

On the other hand regime 2 requires first broadcasting the (single) right- 
hand-side to its subcube at the cost of 

r 0 X (/3 + mr), 

and then having each of the slave processors solve 2 r_1_r ° = 2 k ~ p ~ l tridiag- 
onal systems at the cost of 

(2.21) 2 k ~ p ~~ l x m X [37 + 8 X a;] 

Moreover, the different solutions must be accumulated back to the master 
node at the cost of 


tq\P + mr + 7 + m X u>]. 

Finally, we need to update the vectors p and q according to (2.4), (2.5). 
Note that the vectors involved in updating p^ 1 ^ are same processor, so 
there is no need for communication. The number of vectors that processor 
92 j. 2 r combines is 2 k ~ r ~ p ~ l and the cost is 2 /c ~ r “ p “ 1 (7 + mu;) for (2.4). For 

(2.5), we need to bring the vectors cos * + mT )> 

then do the linear combination at the cost of 27 + 2mu;. These computations 
are identical in both regimes. Note that many processors will be idle at the 
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end of the elimination phase and load balancing can also be performed, but 
we omit the details. 

Summing up we find that each step in the first regime costs a total of 

( 2 . 22 ) T comm = 4/3 + 2 fc - r - pH mr 
for communication and, 

(2.23) T arith = (5 + 3 X + (4 + 2 k ~ r - p+1 + 2*- p+2 )mu 

for arithmetic. Similarly, each step in the second regime costs 

(2.24) Tcomm — ( r 0 + 4)(/3 + mr) 
for communication and 

(2.25) T ar ith = (5 + 3x 2 fe ~ p “ 1 ) 7 + (r 0 + 6 + 2*- p+ 2 )mu; 

and arithmetic. Note the high coefficient in front of the 7 term in the second 
regime which simply indicates that there is no vectorization when solving 
the tridiagonal systems. It is rather difficult to compare the algorithm of 
Section 2.1.1 with the one described in this section based on these com- 
plexity results. Assuming that m = n, we only note that when the latter 
is dominated by the first regime, i.e., when n is very large compared with 
P then the second algorithm is likely to be slightly superior than the first. 
This can be seen by comparing all the coefficients of each of the constants 
/3 ,t ) 7 ,u? amd making the simplification Ps % 2 k ~ l which is valid for large n 
and small r, i.e., for the first reduction steps. If the algorithm is dominated 
by the second regime, i.e., when then n is of the same order as P and so 
there are only a very small number of steps in the first regime, then the 
arithmetic times of the two algorithms are comparable except that there are 
many more start-ups with the second. On the other hand, communication 
is less costly with the second algorithm. 

2.2. A massively parallel block cyclic reduction algorithm. In 

this section we consider the extreme case where the number of processors 
is larger or equal to N = mn the number of grid points. In fact, consider 
the simple case where m = 2 Pl — l,n = 2™ — 1 and p = p\ + p?<> i-e., 
P — m(n + 1 ) > N. Each of the grid points is assigned to a different node 
and corresponds to a well-known mapping of a two-dimensional grid into 
the hypercube as illustrated in Figure 6 . Thus, the physical grid is mapped 
into an array of processors imbedded in the hypercube. In terms of the 
right-hand-side vectors fj the mapping rule is as follows: 

Map3: Assign component fjj to node 
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FlG, 6, Two dimensional Gray-code for on 8 x 4 grid . 


Hence the subvectors / t ,i — 1,2, are stored vertically, each in one 
vertical line of the array depicted in 6. Each horizontal line contains the 
same components of these subvectors. It is clear that if standard BCR were 
used, then the linear combination of the subvectors causes no difficulty and 
would require communication only between nodes that are on the same hor- 
izontal lines. Solving the tridiagonal systems requires interaction between 
grid points of the same subvectors, i.e M communication between nodes that 
are on the same vertical lines. If we use standard Gaussian elimination then 
only one of the nodes of each vertical line will be busy at any given step of 
the forward and backward sweeps. An obvious solution to this difficulty is 
to use scalar cyclic reduction. An alternative would be to use the parallel 
variant of cyclic reduction known as PARACR that avoids the back substi- 
tution phase ([12]). It was observed however that because of the additional 
communication involved in PARACR, cyclic reduction is more economical 

([ 14 ])- - - " - - -■ ^ 

Without going into the complex details of the algorithms, we would like 
to estimate the order of the time required to execute it. We can adapt 
the algorithm of Section 2.1.2 to this case, except that we are now only 
considering the special situation where there is no first regime. There are 
a total of Jb = j>2 steps. Each step starts by assembling the right-hand- 
sides, one for each column of the grid corresponding to the nodes %ji r ' This 
requires communicating a few elements with processors that are at most 
2 hops away in the x-direction, and is a constant with respect to r. The 
tridiagonal systems to be solved next will require to broadcast the right- 
hand- sides to the subcube associated with the master nodes, at a cost roughly 
proportional to the step number r. Assume that this time plus the time 
required to sum up the partial results later on, back to the master node is 
of the form a^r. Then the tridiagonal systems are solved independently at 
roughly a cost of the form a t pi . The rest of the operations on p’s and g’s 
are again constant with respect to r. If we call c the total of all the times 
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that are constant with respect to r, each step will cost approximately 


(2.26) c + a 6 r + a t p! 
which yields a total over the P 2 - 1 steps of 

(2.27) T} scr ~ P2[c + ^ a tJ>i] 

In other words the cost is of the order of 0[log 2 n + log m log n]. The term 
log 2 n comes entirely from the use of the partial fraction expansion and 
appears because of the need to load balance the computation. 

It is interesting to observe the striking difference between the parallel 
complexity of BCR and FFT based algorithms. A simple application of the 
2-D FFT algorithm (for Poisson’s equation) or of the combined FFT and 
tridiagonal solve algorithm (for (2.1)) in the hypercube can yield a paral- 
lel arithmetic complexity of 0(logn + logm), i.e., the sum of the times for 
implementing FFT in the one direction and FFT or scalar cyclic reduction 
in the other. On the other hand the parallel version of BCR described here 
cannot achieve a time better than O(lognlogm). That this is an actual 
limitation of the algorithm rather than just a consequence of the implemen- 
tation being used can be established by looking at the dependency graph of 
the parallel BCR. 

However, as is shown in the next section, there exist alternative algo- 
rithms with a complexity of order 0(logn + logm), which is comparable to 
that of FFT based methods. 


3* Explicit Methods. We now present an algorithm for the solution 
of (2.2) having parallel complexity 0(logn + logm) which is based on using 
the explicit inverse of A 

Denote by S n (x) the shifted Chebyshev polynomial of degree n of the 
second kind defined for 0 < x < 2 by 


5„ W , Mg± M,co.0=./2. 


sin 


e 


(3.1) 

The explicit inverse A~ l can be written in block form, with block (i, j ) 
given by, ([l]) 


(3.2) 


/ S~ l (A)S i . 1 {A)S n _ j (A ), j>i, 
\ S- 1 (A)S j _ 1 (A)S n _,(A), i>j 


As a result we can write the solution u explicitly. The i th block component 
is obtained by multiplying A -1 by / blockwise, yielding: 


Ui 




3 - 1 
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= ^5- 1 ( J 4)5 j _ 1 ( J 4)5 n _ i (A)/ i 
i= i 

+ 5^ ^TJ 1 (-^) , S'»-l(^)^T»-i(^)/j 

i=* 


It is clear from (3.2) that each block A can be expressed as a rational 
matrix function qij{A) with denominator of degree n and numerator of degree 
less than n. As a result we can express each of these as the sum of n 
elementary fractions. Let the coefficients \ be the expansion coefficients 
of each of these rational functions with respect to the elementary fractions. 
In other words, let, 


.(*) 


( 3 - 3 ) Y t TZT k ~ 9ii( ' x) 


k= 1 


f S~ 1 (x)Si-i(x)S n _j(z) for j > i, 
\ S- 1 (x)S j _ 1 {x)S n ^ i (x) for i > j 


Then each of the components u, is given by 


Ui = 


j= 1 

it 

j=i 

E e - wy'fi 

j - 1 k= 1 

±{a-x >ir'±4? fj 

k - l j = i 


which results in the algorithm of Figure 7 for computing the solution. 

From the above description the parallel computational complexity fol- 
lows easily: Step (I) requires log n operations, if n 3 m processors are available. 
The complexity for step (II) depends on the algorithm chosen to solve the 
systems. Since A is tridiagonal, scalar cyclic reduction can be used at a 
cost of O(logm) and n 2 m processors. Step (III) can be completed in logn 
operations with n 2 m processors. Summarizing, the algorithm has parallel 
computational complexity 

Tees = 0(logn + logm). 

To achieve this bound, 0(n 3 m) processors are necessary. The sequential 
complexity of the algorithm is 2 n 3 m + 0(mn J ) operations, which is much 
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Compute the n roots Xk of 5 n (i) in (3.1). 

(k) 

Compute the coefficients or - ; in (3.3). 


DOALL i = 1, n 


DOALL k = 1, n 


compute = E"=i a i ffj 

(I) 

compute Uik = {A - 

(II) 

END DOALL 

(III) 

compute Ui = G;fc 


END DOALL 


Fig. 7. Algorithm: Explicit Elliptic Solver 


higher than the usual 0(mn log n) of fast methods. As was just seen, the 
dominating cost is the computation of the intermediate vectors in the 
innermost loop. This complexity is comparable to that of a direct banded 
solver that does not exploit the special structure. 

Note that the algorithm is as general as the BCR algorithm. It has the 
added benefit that it is very simple to program and that unlike BCR it does 
not require a special treatment when n ^ 2 k — 1. 

We must point out however that the factor O(logn) lower complexity of 
the explicit algorithm compared with that of parallel BCR came at a very 
high price, namely a factor of 0(n 2 ) increase in the number of processors 
required. As we show next, by taking advantage of the structure of the 
problem, we can lower this processor requirement by a factor of O(n). 

We first observe that each of the coefficients is defined by 


(3.4) 


ffc) Sj_i(\k)S n -i(\k) 

* ij - S' n { X h ) 


for i > j 


with, the corresponding formulas for i < j defined by interchanging i and j. 
Therefore, we can write 


sin(j0fc) sin((n + 1 - i)0k) 
sin 2 0 k Shi^k) 

(3 {k) 

sin 2 0*S^(A fc ) 


= sin (jO k ) sin((n + 1 - i)0 k ) 


where we have set 
(3.5) 
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with cos 9k = | Afc. We now use a well known trigonometric formula to obtain 
for i > j 

(3.6) pW = i[cos((n + 1 - (i + j))#k) - cos((n + 1 - (i - i))0fc)] 

To get the formula for the case i < j we need only interchange the indices i 
and jf, leading to the general expression 

(3.7) = ^[cos((n + 1 - (i + j))0k) - cos ((n + 1 — |* — j|)^fc)] 

One then notices that the matrix 2fW — is the sum of the 

Hankel matrix 

{i cos(n + 1 - (* + j))0k}ij = i n 

and the Toeplitz matrix 

^ COS (^ T 1 | f j | = 

It follows that the matrix A W = (aj^) * s a ^ so SUIT1 °f a Hankel and a 
Toeplitz matrix. 

Observe that each of the m coordinates of the subvectors of f i } in the 
previous algorithm, is the result of the product of the matrix A ^ by the 
vector obtained by extracting all the corresponding coordinates from the 
vectors fj. In fact another way of expressing this is by writing that 

( 3 . 8 ) ( F {k) f = A^F t 

where F ^ and F are mxn matrices whose column vectors are the /^’s and 
s, respectively. Each of these products represents the product of a Toeplitz 
plus a Hankel matrix times a vector of size n. We write this as (7\ + H)x 
with T\ Toeplitz and H Hankel. By definition, H can be rewritten as JT 2 , 
where J is the permutation matrix consisting of ones in the antidiagonal 
and 0 everywhere else. Hence the operation becomes (Tj + JT 2 )x. T\x 
and T 2 x are two Toeplitz matrix vector products computable by four FFT 
transforms of size 2 n each, at the cost of O(logn) arithmetic operations 
and 2n processors per FFT. J is a permutation matrix and hence the only 
other computation required is the parallel addition of the two partial results 
requiring n processors. 

This must be multiplied by the number of components in each subvector, 
and by n, the number of roots which leads to 2 n 2 m processors. The rest 
of the algorithm proceeds as before and requires O(logmn) operations with 
0(mn 2 ) processors. We are thus led to the following theorem. 

THEOREM 3.2. Using 0(mn 2 ) processors, system (2.2) can be solved in 
0(logn + logm) parallel steps. 
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4. General Separable Equations. A second order finite difference 
discretization of equation (1.1) with Dirichlet boundary conditions leads to 
the block tridiagonal matrix A 


A + ail 

7i/ 


\ 


A + a 2 I 

72 / 




A + afc/ 7 fe/ 




A»-i/ A + a„_i/ 

7n-l/ 



(U 

A + a n I ) 


We summarize how the techniques we have described so far can deal 
with this case. As we shall see, this generalization comes at the cost of 
extra preprocessing overhead due to the need to compute polynomial roots 
or (equivalently) computing eigenvalues, which in practice may make these 
methods less attractive. 

Block cyclic reduction has been extended by Swarztrauber ([22]) to han- 
dle (1.1). The difficulty here is that the roots of the polynomials generated 
in the course of the reduction are not analytically available as is the case for 
(2.1). As a result, in order to perform the partial fraction decomposition, 
they must be computed numerically. In [22, Table 2] there are timing results 
which show that this preprocessing overhead can be overwhelming. 

Although (1.1) is the most general form we can have, it is not the most 
convenient to work with since it can lead to non- symmetric matrices. This 
is overcome by rewriting (1.1) (if the coefficients allow) or transforming it 
(multiplying by suitable integrating factors) to self-adjoint form. We thus 
assume next that such a transformation has been made and hence fii = 7i_i 
for i — 1 , . . . , n and A = A T . 

The matrix decomposition algorithm described in [4] is based on the 
eigenvalue-eigenvector decomposition of the diagonal blocks of A . When the 
equation is of Poisson type in one direction, a suitable ordering of the un- 
knowns makes the orthogonad matrix of eigenvectors of A equal to a discrete 
Fourier transform operator thus allowing the use of FFT for the matrix- 
vector multiplications. For the above more general matrix A however this 
is no longer true: The eigenvalues and eigenvectors of A must be computed 
numerically and matrix- vector multiplications must be performed explicitly. 
In the experiments described in [22], the generalized BCR seems to perform 
better than matrix decomposition. 

The generalization of the explicit method of Section 3 requires the use 
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of a formula analogous to (3.2) for A. 1 . From [2, Theorem 2.2] 


(4.1) 


P~\A)P i . i (A)R n . j (A), j>i, 
P~ i (A)P j _ 1 (A)R n _ i (A), i>j 


where the n roots of P„ are the negatives of the eigenvalues of the sym- 
metric tridiagonal matrix Tridiag[/3,_i, a,, /3,]. The polynomials P, , Rj are 
computable by means of three term recurrence relations. Hence once the n 
eigenvalues of Tridiag[/3,_!, a;, /3 t ] have been computed the algorithm can 
proceed as in Section 3. 

5. ADI Algorithms. The Alternating Direction Implicit procedure of 
Peaceman and Rachford can be regarded as a fast algorithm for solving 
separable elliptic equations. Although this is an iterative method, if the 
equations resulting from the discretized partial differential equations are 
solved with an accuracy that is of the order of the discretization error, and 
if the optimal parameters are used, then the number of steps required for 
convergence is of the order of O (log 2 n), where n is assumed to be the larger of 
the two numbers of grid points in the x and y directions. This puts the total 
cost in the sequential case to 0(nm log 2 n) [20]. Hence if one can achieve a 
parallelism of the order of nm then there is the possibility of reducing the 
cost of ADI to the same order as what we obtained with BCR in the best 
case. Several of the benefits of ADI have been mentioned in the literature 
for the general nonseparable case or the parabolic equation case ([10,14,15]). 
Here we would only like to mention some of the implementation aspects and 
discuss some advantages over BCR. 

To describe the basic algorithm, consider the partial differential equa- 
tion, 


(5.1) 


d 

dx 




on a rectangular domain with the Dirichlet boundary conditions. 

If the equations sire discretized using a mesh of n + 1 points in the x 
direction and m-(- 1 points in the y direction we get the system of equations: 


(5.2) A x u + B y u = f 

in which the matrices A x and B v represent the 3-point central difference 
approximations to the operators -^(a{x, t/)J^)) and -§^{b(x, y)§^)) respec- 
tively. 

The ADI algorithm consists of iterating by solving (5.2) alternatively in 
the x and y directions as follows: 

(A x + p,/)u <+1 / 2 = ( Pl I - ByW + / 


(5.3) 
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FlG. 8. Domain decomposition and assignment of the square into a 4-processor ring . 

(5.4) (B y + p t A x )u i+1 = ( Pi I - A x )u i+l/2 + /, 

where p*, i = 1 , 2 , is a sequence of acceleration parameters. 

In the following we summarize some of the results described in [15] with 
emphasis on complexity. Observe that if the mesh points are ordered by lines 
in the x direction, then ( 5 . 3 ) constitutes a set of m independent tridiagonal 
systems of size n each. The system (5.4) can also be recast into a set of 
n independent tridiagonal systems of size m each, by reordering the grid 
points by lines, this time in the y direction. This requires essentially to 
transpose the matrix of the n X m grid points and constitutes the main 
difficulty in implementing ADI on parallel architectures. Another difficulty 
that has been traditionally associated with ADI is that classical algorithms 
for solving tridiagonal systems are sequential in nature. 

Consider first the implementation of ADI on a simple ring of processors. 
To avoid transposing data in ADI as pointed out above, we consider the 
special assignment of the grid points into the ring of processors proposed in 
[15] and shown in Figure 8 for the case of a 4 -processor ring. When iterating 
with ADI, the solutions of the systems (5.3) and (5.4) can be performed by a 
regular Gaussian elimination algorithm. Observe that all processors will be 
performing some work at any given stage of the iteration. Communication 
is facilitated by the fact that all neighboring subsquares of the domain are 
in neighboring processors and this is true in both the horizontal and vertical 
direction. The mapping can be succinctly described by 

Map4: Assign component fj % i to node mod[[^] + [—] - 2,P] + 1 
Using the same model for estimating execution time as in Section 2.1, 
with 7 = 0 , a simple complexity analysis shows that the time for implement- 
ing such an algorithm on a ring of P processors is [15] 

_ , x 8 mn 

T(P) = 2 (P - l)/3 + 2 (m + n)r + 
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If p is small compared with m and n, the above formula shows that the op- 
timal speed-up of P is nearly reached provided the communication constants 
/3, r are not too big. However, as the number of processors increases the com- 
munication time may become too high. In fact for the case m = n, the mini- 
mal time that can be achieved on an arbitrarily large ring is 4(2t/J$lJ+2t )m, 
see [15] which is linear in m. 

The next simple architecture to be considered is that of a two dimen- 
sional grid. In [15] it was shown for the case m = n, that mapping the n 2 
grid points of the square homographically into a k X k grid of processors, 
and using a substructured Gaussian Elimination [10,19], the total time for 
one of the solution steps in ADI is of the form 

T g {P) « ay + + V'/P + 0(1). 

where a, £, r) are constants independent of P. The minimum time for an 
arbitrarily large processor grid is of the form 0(n 2 / 3 ). Multiplying this by 
the number of steps which is 0( log 2 n) we arrive at an asymptotic complexity 
of 

Topt.Grid = 0(n 2/3 log 2 n) 

compared with 0(n 2 log 2 n) in the sequential case. Note that the number of 
processors to achieve this optimal time is 0(n 4 / 3 ). 

We next consider the implementation of ADI on hypercubes. We sim- 
plify the exposition by assuming again that m = n. We use the same 
mapping as before by embedding the 2-D grid into the hypercube as was 
described in Section 2.2. Then, scalar cyclic reduction is employed to solve 
the successive tridiagonal systems in the algorithm. Assume that the 2-D 
mesh is first subdivided into small {ti/k) X (n/«) squares and that the sub- 
square in position (i, j) is assigned to processor (t, j) of the grid. Then each 
of the solve phases in ADI amounts to solving in each row or column of the 
grid ti/k independent tridiagonal systems each of which is split into k equal 
parts. 

Consider the process on each of the ti/k tridiagonal systems separately. 
Each of the first log(n//c) steps of cyclic reduction requires only communica- 
tion between neighboring processors in which a fixed number of elements is 
transmitted to neighbors namely 4 elements from each direction. The total 
time for arithmetic operations of the forward and backward sweep is 0(ti/k) 
since it is similar to that of performing the cyclic reduction algorithm on a 
tridiagonal system of size ti/k on a single processor. After these log(n/*c) 
first steps are completed, each processor will end up with one equation of a 
kxk tridiagonal system. Cyclic reduction on such a system can be performed 
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in time 0(log/c) thanks to the fact that the distance between equations i 
and i + 2* is constant due to the assignment using Gray codes [13]. 

The total time for all ti/k systems is of the form OQ 2 ) + O(^logP). 
Observe that for the maximum allowable value of P, P = n we get a time of 
the form O(logP). Therefore, a logarithmic time in n is achievable for each 
step of ADI with the hypercube topology. Moreover, the total time over the 
0(log 2 n) steps required for convergence would become T^hyp = 0(\og n) 
which does not compare favorably with the 0( log 2 n) of the hypercube BCR 
described earlier. On the other hand the implementation of ADI is far sim- 
pler than the parallel BCR. All that is required is to implement efficient 
multiple tridiagonal solvers in the x and y directions. Moreover, ADI is 
more general than block cyclic reduction, although the theory for nonsepa- 
rable problems does not provide the optimal parameters and the number of 
steps may be much higher than what is obtained with separable problems. 

6. Conclusions. We have proposed several parallel implementations 
of fast algorithms for solving elliptic equations. As was shown in [8] the 
block cyclic reduction algorithm using partial fraction expansions leads to a 
viable and efficient approach for computers with small numbers of proces- 
sors. In this paper we have also considered the case where the number of 
processors is large compared with the problem size. One common feature 
of all the different variants is the complexity of their implementation. This 
problem becomes even more acute when the problem dimension is not care- 
fully chosen (e.g. n = 2 k - l for BCR). It is not clear to us whether much 
simpler algorithms should not be preferred even at the expense of sacrificing 
some efficiency. The explicit methods described in Section 3 are simpler to 
implement but require far too many processors to be of practical use for 
problems of reasonable size. The alternative of the ADI techniques consid- 
ered in Section 5 constitutes a good compromise between efficiency and ease 
of implementation . 
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